e-Appendix C 


The E-M Algorithm 


The E-M algorithm is a convenient heuristic for likelihood maximization. The 
E-M algorithm will never decrease the likelihood. Our discussion will focus 
on mixture models, the GMM being a special case, even though the E-M 
algorithm applies to more general settings. 

Let Pi.(x;6,) be a density for k = 1,..., K, where 6, are the parameters 
specifying Pk. We will refer to each Py, as a bump. In the GMM setting, 
all the P are Gaussians, and 6, = {Hk, Xx} (the mean vector and covariance 
matrix for each bump). A mixture model is a weighted sum of these K bumps, 


K 


P(x;9) = 5 Wr Pp (X; Ok), 


k=1 


where the weights satisfy wg > 0 and y wk = 1 and we have collected all 
the parameters into a single grand parameter, © = {w1,...,wK;01,...,9x«}. 
Intuitively, to generate a random point x, you first pick a bump according to 
the probabilities w1,..., wg. Suppose you pick bump k. You then generate a 
random point from the bump density Pk. 

Given data X = x1,..., Xy generated independently, we wish to estimate 
the parameters of the mixture which maximize the log-likelihood, 


N 
In P(X|O) = ln II P(xn|®) 
5 
= In II (>: WEP (Xn} a) 
n=1 \k=1 


(C.1) 
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In the first step above, P(X |O) is a product because the data are independent. 
Note that X is known and fixed. What is not known is which particular bump 
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e-C. THE E-M ALGORITHM 


was used to generate data point xn. Denote by jn € {1,..., K} the bump 
that generated Xn (we say Xn is a ‘member’ of bump jn). Collect all bump 
memberships into a set J = {j1,...,jnw}. If we knew which data belonged 
to which bump, we can estimate each bump density’s parameters separately, 
using only the data belonging to that bump. We call (X,J) the complete 
data. If we know the complete data, we can easily optimize the log-likelihood. 
We call X the incomplete data. Though X is all we can measure, it is still 
called the ‘incomplete’ data because it does not contain enough information 
to easily determine the optimal parameters ©* which minimize Fj,(0). Let’s 
see mathematically how knowing the complete data helps us. 

To get the likelihood of the complete data, we need the joint probability 
P[Xn, jn|O]. Using Bayes’ theorem, 


P[Xn;jn|O] = Plin|O] P [nln O] 


Win Pin (Xn; 9;,, . 


Since the data are independent, 


N 


n=1 


N 
J| 5. Pin ni 03n) 
n=1 


Let Nz be the number of occurrences of bump k in J, and let X; be those 
data points corresponding to the bump k, so X,_ = {Xn E€ X : jn = k}. We 
compute the log-likelihood for the complete data as follows: 


N N 
mP(X, JIO) = SY nw, +) In Pj, %n54j,) 
n=1 


n=1 


K K 
= SY Nnt) SO Py (%nj 9x) 
k=1 





k=1 xn€Xe 
Le (Xxr,Ox) 
K K 
= XO Ng lowe +Y Lr(Xr; 0r). (C.2) 
k=1 k=1 


There are two simplifications which occur in (C.2) from knowing the complete 
data (X, J). The wx (in the first term) are separated from the 0; (in the second 
term); and, the second term is the sum of K non-interacting log-likelihoods 
Lk(Xk, 0k) corresponding to the data belonging to Xx and only involving bump 
k’s parameters 0%. Each log-likelihood Ly, can be optimized independently of 
the others. For many choices of Py, Lk(Xk; 0k) can be optimized analytically, 
even though the log-likelihood for the incomplete data in (C.1) is intractable. 
The next exercise asks you to analytically maximize (C.2) for the GMM. 
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Exercise C.1 


(a) Maximize the first term in (C.2) subject to `, we = 1, and show that 
the optimal weights are w% = N/N. [Hint: Lagrange multipliers.] 


(b) For the GMM, 


1 ase 
P(X; He, Xk) = CERE XP (== Hk) De (x-— Hx) > 


Maximize Lk(Xk; uk, Ux) to obtain the optimal parameters: 


r 1 
Hk = N. SS Xn; 
XnEXk 
a 1 
DE = N XO (Kn — Me) (Xn — me)”. 
xn EX, 


These are exactly the parameters you would expect. jj, is the in- 
sample mean for the data belonging to bump k; similarly, £% is the 
in-sample covariance matrix. 

[Hint: Set Sk = ie and optimize with respect to Sx. Also, from the 
Linear Algebra e-appendix, you may find these derivatives useful: 


2 (zSz) =z2z; and 2 heel = 57] 

In reality, we do not have access to J, and hence it is called a ‘hidden vari- 
able’. So what can we do now? We need a heuristic to maximize the likelihood 
in Equation (C.1). One approach is to guess J and maximize the resulting 
complete likelihood in Equation (C.2). This almost works. Instead of maxi- 
mizing the complete likelihood for a single guess, we consider an average of 
the complete likelihood over all possible guesses. Specifically, we treat J as 
an unknown random variable and maximize the expected value (with respect 
to J) of the complete log-likelihood in Equation (C.2). This expected value 
is as easy to minimize as the complete likelihood. The mathematical imple- 
mentation of this idea will lead us to the E-M Algorithm, which stands for 
Expectation-Maximization Algorithm. Let’s start with a simpler example. 


Example C.1. You have two opaque bags. Bag 1 has red and green balls, 
with pi being the fraction of red balls. Bag 2 has red and blue balls with u2 
being the fraction of red. You pick four balls in independent trials as follows. 
First pick one of the bags at random, each with probability 4; then, pick a ball 
at random from the bag. Here is the sample of four balls you got: ©0000, 
one green, one red and two blue. The task is to estimate yı and u2. It would 
be much easier if we knew which bag each ball came from. 

Here is one way to reason. Half the balls will come from Bag 1 and the 
other half from Bag 2. The blue balls come from Bag 2 (that’s already Bag 2’s 
budget of balls), so the other two should come from Bag 1: @@ | @@. Using 
in-sample estimates, (41 = 4 and jfi2 = 0. We can get these same estimates 





© M Abu-Mostafa, Magdon-Ismail, Lin: Jan-2015 e-Chap:C-3 
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using a maximum likelihood argument. The log-likelihood of the data is 
In(1 — u1) + 2ln(1 — u2) + In(p + pe) — 4ln2. (C.3) 


The reader should explicitly maximize the above expression with respect to 
H1, u2 € [0,1] to obtain the estimates fi; = 4, fiz = 0. In our data set of four 
balls, there is a red one, so it seems a little counter-intuitive that we would 
estimate fig = 0, for isn’t there a positive probability that the red ball came 
from Bag 2? Nevertheless, jig = 0 is the estimate that maximizes the proba- 
bility of generating the data. You are uneasy with this, and rightly so, because 
we put all our eggs into this single ‘point’ estimate; a very unnatural thing 
given that any point estimate has infinitesimal probability of being correct. 
Nevertheless, that is the maximum likelihood method, and we are following it. 

Here is another way to reason. ‘Half’ of each red ball came from Bag 1 
and the other ‘half’ from Bag 2. So, fi: = $/(1+ 5) = 4 because § a red ball 
came from Bag 1 out of a total of 1 + $ balls. Similarly, fg = 4/(2 +4) = . 
This reasoning is wrong because it does not correctly use the knowledge that 
the ball is red. For example, as we just reasoned, fi; = 3 and jig = z. But, if 
these estimates are correct, and indeed fi; > fiz, then a red ball is more likely 
to come from Bag 1, so more than half if it should come from Bag 1. This 
contradicts the original assumption that led us to these estimates. 

Now, let’s see how expectation-maximization solves this problem. The rea- 
soning is similar to our false start above; it just adds iteration till consistency. 
We begin by considering the two cases for the red ball. Either it is from Bag 1 
or Bag 2. We can compute the log-likelihood for each of these two cases: 


In(1 — u1) + In(wy) + 2In(1 — pe) — 41n2 (Bag 1); 

In(1 — u1) + In(w2) + 2In(1 — pe) — 41n2 (Bag 2). 
Suppose we have estimates fi; and fig. Using Bayes theorem, we can compute 
pı = P[Bag 1 | fy, fg] and py = P[Bag 2 | iy, fig]. The reader can verify that 
es pant 

Îi + fia’ ĝa + fi” 
Now comes the expectation step. Compute the expected log-likelihood using 


pı and po: 
E [log-likelihood] = In(1— p1) + p1 In(1) + p2 In(w2) +2 In(1— pe) —41n2. (C.4) 


Pı 














Next comes the maximization step. Treating pı, p2 as constants, maximize 
the expected log-likelihood with respect to 41,42 and update /i1, jig to these 
optimal values. Notice that the log-likelihood in (C.3) has an interaction 
term In(f41 + u2) which complicates the maximization. In the expected log- 
likelihood (C.4), pı and u2 are decoupled, and so the maximization can be 
implemented separately for each variable. The reader can verify that maxi- 
mizing the expected log-likelihood gives the updates: 


x Pı Hı 


foe no - P P2 H2 
l+pi = 2ft1 + fle 


and — R; 
Heg +p2 21 + 3fte 
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The full algorithm just iterates this update process with the new estimates. 
Let’s see what happens if we start (arbitrarily) with estimates fi; = fig = 5 


Iteration number 


2 3 4 5 6 7 ~~... 1000 





1 
fu | 5 4 0.38 0.41 0.43 0.45 0.45 0.46 ... 0.49975 
ĝa |3 + 0.16 0.13 0.10 0.09 0.07 0.07 ... 0.0005 


We have highlighted in blue the result of the first iteration, which is exactly the 
estimate from our earlier faulty reasoning. When pı = u2 our faulty reasoning 
matches the E-M step. If we continued this table, it is not hard to see what 
will happen: fi1 > 4 and jig > 0. 


Exercise C.2 


When the E-M algorithm converges, it must be that 


ja = m amd fi2 = ê 
Qf + fle 2fta + 32 
Solve these consistency conditions, and report your estimates for /i1, fi2? 


It’s miraculous that by maximizing an expected log-likelihood using a guess for 
the parameters, we end up converging to the true maximum likelihood solution. 
Why is this useful? Because the maximizations for 44; and u2 are decoupled. 
We trade a maximization of a complicated likelihood of the incomplete data 
for a bunch of simpler maximizations that we iterate. 














C.1 Derivation of the E-M Algorithm 


We now derive the E-M strategy and show that it will always improve the 
likelihood of the incomplete data. 


Maximizing an expected log-likelihood of the complete data increases 
the likelihood of the incomplete data. 


What is surprising is that to compute the expected log-likelihood, we use a 
guess, since we don’t know the best model. So it is really a guess for the 
expected log-likelihood that one maximizes and this increases the likelihood. 

Let ©’ be any set of parameters, and define P(J|X, 0’) as the conditional 
probability distribution for J given the data and assuming that 0’ is the actual 
probability model for the data. The probability P(J|X, ©’) is well defined, 
even if O’ is not the probability model which generated the data. We will see 
how to compute P(J|X, 0’) soon, but for the moment assume that it is always 
positive (which means that every possible assignment of x1, ..., Xy to bumps 
ji,---,jn has non-zero probability under ©’). This will alive be the case 
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unless some of the Py have bounded support or O’ is some degenerate mixture. 
The following derivation establishes a connection between the log-likelihood 
for the incomplete data and the expectation (over J) of the log-likelihood for 
the complete data. 


L(©) = InP(X|e) 

£ ee) 

(b) P(X, J|O) ; 
= nD PUXO) PUXO) P(J|X,9') 


(© P(X, J|@)P(X|@’) 
= 2 POJO) 


Q X, J|O)P(X|0’) ; 
? Dh (E pace ) PIX’) 


P(J|X,0’) 


























2 LO) + Ejx,er[n P(X, J|@)] — Enx olin P(X, J10) 
2 16’) + Q(@|X,6’)— QO'I, O’) (C5) 





(a) follows by the law of total probability; (b) is justified because P(J|X, O’) is 
positive; (c) follows from Bayes’ theorem; (d) follows because the summation 
Yo ()PU | X, 9’) is an expectation using the probabilities P(J | X, 0’) and 
by Jensen’s inequality In E[-] > E[ln(-)] because In(x) is concave; (e) follows 
because In P(X|9’) is independent of J and so 




















XC P(J|X, 0’) In P(X|’) = In P(X|0’) XC P(J|X, 0’) =n P(X|0’) -1 
J J 


finally, in (f), we have defined the function 





Q(0|X, ©’) = Ej, x,’[In P(X, J|9)}. 











The function Q is a function of ©, though its definition depends the distribu- 
tion of J which in turn depends on the incomplete data X and the model ©’. 
We have proved the following result. 


Theorem C.2. If Q(O|X, 0’) > Q(0’|X, 0’), then L(O) > L(O’). 


In words, fix O’ and compute the ‘posterior’ distribution of J conditioned on 
the data X with parameters 0’. Now for the parameters ©, compute the 
expected log-likelihood of the complete data (X, J) where the expectation is 
taken with respect to this posterior distribution for J that we just obtained. 
This distribution for J is fixed, depending on X, ©’, but it does not depend 
on ©. Find ©* that maximizes this expected log-likelihood, and you are 
guaranteed to improve the actual likelihood. This theorem leads naturally to 
the E-M algorithm that follows. 
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E-M Algorithm 
1: Initialize Oo at t = 0. 
: At step t let the parameter estimate be O+. 
: [Expectation] For X, ©+, compute the function of Q;(9): 











Q:(9) = Ej,x,0, [In P(X, J|9)], 





which is the expected log-likelihood for the complete data. 
: [Maximization] Update © to maximize Q;(0): 


Oni = stems it ). 


5: Increment t > t+ 1 and repeat steps 3,4 till convergence. 





In the algorithm, we need to compute Q;(9), which amounts to computing an 
expectation with respect to P(J|X,@©;). We illustrate this process with our 
mixture model. 

Recall that J is the vector of bump memberships. Since the data are 
independent, to compute P(J|X,0,) we can compute this ‘posterior’ for each 
data point and then take the product. We need yng = P(jn = k|Xn, O+), the 
probability that data point x, came from bump k. By Bayes’ theorem, 


P(Xp, k|Oz) 
P(Xn|9r) 
tk Pe(Xn|Ox) 
Wer hePe(Xn|Be) 


where ©; = {ti1,...,0K;01,...,9«} and P(J|X,0’) = fe Ynja- We can 
now compute Q;(0), 


Ynk = Plin = k|Xn, Or) 





Q:(O) = E;[In P(X, J|O)], 











where the expectation is with respect to the (‘fictitious’) probabilities yn, 
that determine the distribution of the random variable J. These probabilities 
depend on X and @;. Let Nz (a random variable) be the number of occurrences 
of bump k in the random variable J; similarly, let Š, be the random set 
containing the data points of bump k. From Equation (C.2), 


K 
[Ne] In wk +> EJ 5 In P(Xn; Ox) 


k=1 xn EX, 


N K N 
sy pS oa ln wk + i bs Znk In P(Xn; s)| 
k=1 


n=1 n=1 


























Q:(0) = 


























K N 


Nz ln wk + 5 Do Ynk IN P(Xn; Ox), 
k=1 k=1n=1 
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where © = {wi,.--,wWK;1,---,9«}. In (a), we introduced an indicator ran- 
dom variable znk = [Xn € Xk] which is 1 ifx,, is from bump k and 0 otherwise; 
in (b), we defined 


























. N N 
Ny = Ej[N;] = Ez È an = X Ank, 
n=1 


n=1 











where we used E[zn%] = Yng. Now that we have an explicit functional form 
for Q:(©), we can perform the maximization step. Observe that the bump- 
parameters 6, E€ O are occurring in independent terms, and so can be opti- 
mized separately. As for the first term, observe that 








K K K 
1 Nk Wk Nk Nk 
— Nz 1 = — | — In — 
DD k M Wk UW! Nan te NUN 
k=1 =1 k=1 
K 
Nk Nk 
< ARA o EAA 
< W In NV” 
k=1 


where the last inequality follows from Jensen’s inequality and the concavity of 
the logarithm, which implies: 


KN w K A w s 
k k k k 
` < X > = ` = =0. 
ln -JN In ( 7 ) u( wa) In(1) =0 


Equality holds when wp = Nz/N. Maximizing Q:(wi,...,wK,91,...,9K), 
therefore, gives the following updates: 








rn Nk y Ynk 
Wk = N a ae ; (C.6) 
N 
i = argmax X` nk ln P(Xn; Ox). (C.7) 
Ok n=1 


The update to get w% can be viewed as follows. A fraction Yng of data point Xn 
belongs to bump k. Thus, Nk = J „ Ink is the total number of data points 
belonging to bump k. Similarly, the update to get 0% can be viewed as follows. 
The ‘likelihood’ for the parameter 6, of bump k is just the weighted sum of the 
likelihoods of each point, weighted by the fraction of the point that belongs to 
bump k. This intuitive interpretation of the update is another reason for the 
popularity of the E-M algorithm. 

The E-M algorithm for mixture density estimation is remarkably simple 
once we get past the machinery used to set it up: we have an analytic update 
for the weights wọ and K separate optimizations for each bump parameter 6x. 
The miracle is that these simple updates are guaranteed to improve the log- 
likelihood (from Theorem C.2). There are other ways to maximize the like- 
lihood, for example using gradient and Hessian based iterative optimization 
techniques. However, the E-M algorithm is simpler and works well in practice. 
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Example Let’s derive the E-M update for the GMM with current parameter 
estimate is O; = {t,...,WK;fl1,.--,4Kj21,---, UK}. Let Sp = (Xk) t. 
The posterior bump probabilities Yn for the parameters ©, are: 
_ _ HP (Xn| fie, Se) 

Xia OP (xnl Êe, Se) 
where P(x|u, S) = (27)7%/?]S|!/? exp(—3(x — u)"S(x — ps). The wx update 
is immediate from (C.6), which matches Equation (6.9). Since 


Ynk 


1 a 1 d 
In P(x, 8) = =3(x = )"S(x = u) + 5 In [S| — $ a(27), 


to get the ux and S; updates using (C.7), we need to minimize 


N 
5 "Ink (Xn — Hk) Sk(Xn — Hk) — ln |Sk]) 
n=1 


Setting the derivative with respect to pz to 0, gives 
N 

25x 5 Ynk(Xn = Hk) = 0. 
n=1 


Since Sx is invertible, juz pee Ynk = 5% YnkXn, and since Nk = pee, Faka 
we obtain the update in (6.9), 


1 N 
E nar nkXn. 
H N; Oath 


To take the derivative with respect to S, we use the identities in the hint of 
Exercise C.1(b). Setting the derivative with respect to S to zero gives 


N N 
5 nk (Xn a? Lr) (Xn = Hr)” 1 oe ye Ynk = 0, 
n=1 n=1 
or 
ix 
4 = N; 2 Men — Hg)(Xn — Hk)”. 
Since Sy” = Xp, we recover the update in (6.9). 


Commentary The E-M algorithm is a remarkable example of a recurring 
theme in learning. We want to learn a model © that explains the data. We 
start with a guess Ô that is wrong. We use this wrong model to estimate 
some other quantities of the world (the bump memberships in our example). 
We now learn a new model which is better than the old model at explain- 
ing the combined data plus the inaccurate estimates of the other quantities. 
Miraculously, by doing this, we bootstrap ourselves up to a better model, one 
that is better at explaining the data. This theme reappears in Reinforcement 
Learning as well. If we didn’t know better, it seems like a free lunch. 
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C.2 Problems 


Problem C.1 Consider the general case of Example C.1. The sample 
has N balls, with Ng green, N, blue and N, red (Ng + Ny + Nr = N). Show 
that the log-likelihood of the incomplete data is 


Ng In(1 — pi) + Ne In(1 — pe) + Nr ln(m + 2) — N ln2. (C.8) 


What are the maximum likelihood estimates for u1, u2. (Be careful with the 
cases N, > N/2 and M, > N/2. 


Problem C.2 Consider the general case of Example C.1 as in Problem C.1, 
with N, green, N, blue and N, red balls. 


(a) Suppose your starting estimates are fii, fiz. For a red ball, what are 
pi = P[Bag 1|ĝ1, fi2] and p2 = P[Bag 2|ĝ1, fi] 
(b) Let N,, and Nr, (with N, = Nr, + Nr) be the number of red balls from 


Bag 1 and 2 respectively. Show that the log-likelihood of the complete 
data is 


Ng In(1 — u1) + Np In(1 — u2) + Ne, In(w1) + Nro In(u2) — N In 2. 


(c) Compute the function Q:(u1, u2) by taking the expectation of the log- 
likelihood of the complete data. Show that 


Qz (Ha, u2) = Ng In(1—pa)+Np In(1~p2)+pi Nr n(u1)+p2Nr In(p2)—N In 2. 
(d) Maximize Q¢(j11, u2) to obtain the E-M update. 
(e) Show that repeated E-M iteration will ultimately converge to 


. _ N-2N, . _ N-2Ns 
ia ja = ——- 


Problem C.3 A sequence of N balls, X = x71,...,xn is drawn iid as 
follows. There are 2 bags. Bag 1 contains only red balls and bag 2 contains 
red and blue balls. A fraction m in this second bag are red. A bag is picked 
randomly with probability 5 and one of the balls is picked randomly from that 
bag; £n = 1 if ball n is red and 0 if it is blue. You are given N and the number 
of red balls N, = Ss Tn. 


(a) (i) Show that the likelihood P[X|z, N] is 


P[X|x, N] = I (=) (: : ae 


Maximize to obtain an estimate for m (be careful with N, < N/2). 
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(ii) For N, = 600 and N = 1000, what is your estimate of r. 


(b) Maximizing the likelihood is tractable for this simple problem. Now de- 
velop an E-M iterative approach. 


(i) What is an appropriate hidden/unmeasured variable J = j1,..., jN- 
(ii) Give a formula for the likelihood for the full data, PLX, J|z, N]. 
(iii) If at step t your estimate is m+, for the expectation step, compute 

Qi (7) = Ejix m|- m PIX, Jr, N]] and show that 


Tt 








Qir) = N, lor + (N — N,)ln(1 — 7), 


1+ Tt 
and hence show that the E-M update is given by 


miNr 
TN; + (1 +m (N -a N,) : 


Tt+1 = 


What are the limit points when N, > N/2 and N, < N/2? 

(iv) Plot m; versus t, starting from mo = 0.9 and ro = 0.2, with N, = 
600, N = 1000. 

(v) The values of the hidden variables can often be useful. After con- 
vergence of the E-M, how could you get estimates of the hidden 
variables? 


Problem C.4 [E-M for Supervised Learning] We wish to learn a 
function f(x) which predicts the temperature as a function of the time x of 
the day, x € [0,1]. We believe that the temperature has a linear dependence 
on time, so we model f with the linear hypotheses, h(x) = wo + wiz. 


We have N data points y1,..., yw, the temperature measurements on different 
(independent) days, where yn = f(%n) + €n and en ~ N(0,1) is iid zero mean 
Gaussian noise with unit variance. The problem is that we do not know the time 
£n at which these measurements were made. Assume that each temperature 
measurement was taken at some random time in the day, chosen uniformly on 
[0, 1]. 


(a) Show that the log-likelihood for weights w is 


N: 1 
5 ; 1 1 3 

In Ply|w] aan In (/ dx eB Yn—wo—w1 2) ) ; 
n=1 0 V 2 





T 


(b) What is the natural hidden variable J = j1,..., jn. 
(c) Compute the log-likelihood for the complete data In Ply, J|w]. 
(d) Let yn(x|w) = P(£n = z|yn, w). Show that, for x € [0,1], 


Hence, compute Q:(w). 





© M Abu-Mostafa, Magdon-Ismail, Lin: Jan-2015 e-Chap:C-11 











e-C. THE E-M ALGORITHM C.2. PROBLEMS 














(e) Let an = Ez, (e|w,) [£] and Bn = E,,, (alwi) [£7] (expectations taken with 
respect to the distribution yn(x|w+)). Show that the EM-updates are 
nern = PLDs), 
b-a 
wo(t+1) = J-wi(t+1)a; 





where, (-) denotes averaging (eg. @ = qe Qan) and w(t) are the 
weights at iteration t. 

(f) What happens if the temperature measurement is not at a uniformly 
random time, but at a time distributed according to an unknown P(x)? 
You have to maintain an estimate P;(x). Now, show that 


apes Pi(x ) exp (— i (Yn — wo — wi 2)?) 
Yn (zlw) fo dx P,( (x) exp (— 1 (Yn — wo — wix)?)’ 


and that the updates in (e) are unchanged, except that they use this new 
n(x|wz). Show that the update to P; is given by 


What happens if you tried to maximize the log-likelihood for the incom- 
plete data, instead of using the E-M approach? 
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